import pandas as pd # The gold standard of Python data analysis, to create and manipulate tables of data
import numpy as np # The Python module for processing arrays which/Pandas is based on
import matplotlib.pyplot as plt # The gold standard of Python data visualization, but can be complex to use
import seaborn as sns; sns.set() # A package to make Matplotlib visualizations more aesthetic
import branca
import geopandas
import folium # package for making maps, please make sure to use a version older than 1.0.0.
from wordcloud import WordCloud # A package that will allow us to make a wordcloud
from scipy.stats import ttest_ind # A module for Python machine learning--we'll stick to T-Tests here
from IPython.display import display
from folium.plugins import TimeSliderChoropleth
# from time_slider_choropleth import TimeSliderChoropleth
%matplotlib inline
plt.rcParams["figure.figsize"] = (8,5)
"I think that's how Chicago got started. A bunch of people in New York said, 'Gee, I'm enjoying the crime and the poverty, but it just isn't cold enough. Let's go west." - Richard Jeni
"Chicago is known for good steaks, expensive stores, and beautiful architecture. Unfortunately, the Windy City also enjoys a reputation for corrupt politics [and] violent crime." - Bob Barr
Exploratory data analysis is one of the most important portions of data science. This case builds upon Case 4.3 to encourage further practice with a focus on technical implementation. We hope after this case students are able to think critically and ask logical questions, as well as have the technical capability to investigate those questions. Technical capabilities in this case mainly refers to utilizing data visualizations, manipulating DataFrames in pandas and creating custom Python functions to quantify metrics.
Business Context. Congratulations! You were recently promoted to regional Chief of Strategy of the Chicago Police Department. You have many years of experience with field work, but this is your first time having to think about the bigger picture. Chicago is a large city, and your resources are limited. Thus you need to devise a comprehensive plan to enhance the efficiency of police force deployment to fight crime. Making data-driven decisions is essential, even in law enforcement where prior knowledge usually dominates the decision-making process.
Business Problem. Your main task is to explore the data and identify patterns of crime in Chicago, and come up with strategies to efficiently deploy your workforce to fight crime.
Analytical Context. So, you found a dataset available to the Chicago PD from 2017 with information on crimes committed throughout the city. In this case, we will focus on exploratory analysis to construct some preliminary strategies for police deployment. These strategies can be further consolidated or dismissed using more rigorous statistical analysis. One of the key aspects of this case is that our data contains records of crime incidents where often we do not have a clear definition of outcome (such as "severity" of a crime). We will discuss ways of dealing with such data, and how they can be incorporated to have meaningful conclusions.
The case is structured as follows. We will (1) look at univariate summaries (2) come up with a preliminary strategy based on this (3) look at joint distributions and revise our strategy, and finally (4) think about changing strategies depending on our priorities and severity of the crimes.
Let's read in and view our dataset. This dataset is downloaded from this website. It contains reported crime incidents (with the exception of murders, where data exists for each victim) that occured in the City of Chicago in 2017.
We began Case 4.3 with a basic exploration of the distribution of the various parameters. Since this dataset is more focused on categorical data, we will start by investigating the various frequencies of each category within each parameter:
df = pd.read_csv('Chicago_crime_data.csv', dtype={'ID': object, 'beat_num': object})
df
We can see above that the table contains 22 columns and there are 268,303 records in total. Since each homicide case could have more than one row, the actual number of cases is smaller than 268,303. Blow is a brief description of each column:
| Variable name | Variable description | Note |
|---|---|---|
| ID | Unique identifier for the record | Each victim in a single homicide case is assigned to a different ID |
| Case Number | The Chicago Police Department RD (Records Division) number | Unique to the incident. Multiple IDs can share the same Case Number if the incident is a homicide case |
| Date | Date when the incident occurred | Might be a best estimate for some records |
| Block | The partially redacted address where the incident occurred | The redacted address is in the same block as the actual address |
| IUCR | The Illinois Uniform Crime Reporting code | Directly linked to the primary type and the description of the crime. See details here |
| Primary Type | The primary description of the IUCR code | - |
| Description | The secondary description of the IUCR code | - |
| Location Description | Description of the location where the incident occurred | - |
| Arrest | Whether an arrest as made | - |
| Domestic | Whether the incident was domestic-related | Domestic-related definition is based on the Illinois Domestic Violence Act |
| beat_num | The police beat where the incident occurred | Smallest police geographic area - each beat has a dedicated police beat car. See details here |
| District | The police district where the incident occurred | Three to five beats make up a police sector and three sectors make up a police district. See details here |
| Ward | The ward where the incident occurred | Wards are city council districts. See details here |
| Community Area | The community area where the incident occurred | See details here |
| FBI Code | The crime classification as outlined in the FBI's NIBRS | NIBRS stands for National Incident-Based Reporting System (NIBRS). See details here |
| Latitude | The latitude of the location where the incident occurred | This location is shifted from the actual location for partial redaction but falls on the same block |
| Longitude | The longitude of the location where the incident occurred | - |
There are quite a few factors, but most are either: (1) identifying information (e.g. id, ICUR); or (2) too granular to start with (e.g. latitude, longitude). Therefore, we will first focus on the following variables: primary_type, description, location_description (location types), date (time of occurrence) and beat_num (geographic location), which give valuable information without getting too granular too quickly. Our outcome of interest is the number of crime incidents.
Similar to the last EDA case, it makes sense to explore the relationship of primary_type and description with our outcome of interest, crime incidents. However, we cannot repeat the exact process where we look at the pairwise correlations between the variables of interest and the outcome. This is because both primary_type and description are categorical variables, so it would not make send to place them on a scatterplot and calculating correlations makes little sense.
Luckily, both variables are discrete so we can still count the total number of records which belong to a specific category for each of these two variables using a frequency table. Note that primary_type and description are nested variables, meaning each type of primary_type has its own set of descriptions that do not overlap. If two crimes have different primary types then they cannot have the same description by definition.
df["Primary Type"].value_counts()
We can see the most prevalent primary type of crime is theft, followed by battery and criminal damage. More severe types, such as homicide, arson and human trafficking, are very rare. A more detailed description of crime types is listed in the Description column. We can further break down the above frequencies by Description since Primary Type and Description are nested variables. The resulting frequency table is shown below:
Write code using the groupby function, to count the number of cases in all combinations of Primary Type and Description. Then sort the results in decreasing order of the number of cases. Based on the results, what are most prevalent descriptions of theft, battery and criminal damage cases in Chicago?
df.groupby(["Primary Type","Description"])["ID"].count().reset_index(name="count")\
.sort_values(by="count", ascending = False).reset_index(drop=True).head(20)
Answer. Summarizing the data by this more detailed classification of crime types reveals what the prevalent crime descriptions are within each primary type. For example:
There are in fact 310 descriptions in total and listing them all here is not viable. However, we can use a visualization tool known as a word cloud to summarize the prevalent descriptions within each primary type. A word cloud visualizes the words within a collection of texts (in our case, the texts are all Descriptions for a specific primary type) and the size of each word is proportional to how often it appears in the datase. Below, we construct three word clouds for the top 3 most prevalent primary crime types:
# wordcloud for primary type defined by rank
def wordcloud_crime( df, rank ):
df_filter = df[df["Primary Type"]==df["Primary Type"].value_counts().index[rank]]
text = ' '.join(df_filter['Description'])
wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
print("Crime type: ", df["Primary Type"].value_counts().index[0])
wordcloud_crime( df, 0 )
From the above wordcloud, it seems that the words "building" and "retail" are strongly linked to theft offenses, indicating that theft likely mainly happened indoors and in malls or retail stores.
Use the code above to generate the wordcloud for battery cases and criminal damage cases. What are the most common words to describe these two types of cases?
print("Crime type: ", df["Primary Type"].value_counts().index[1])
wordcloud_crime( df, 1 )
print("Crime type: ", df["Primary Type"].value_counts().index[2])
wordcloud_crime( df, 2 )
Answer. Battery is strongly linked the word "domestic", implying that battery charges usually involved family members. Criminal damage was strongly associated with the words "property" and "vehicle", indicating the targets for most criminal damage cases.
As we have seen with word clouds, it seems that a given type of crime is usually linked with certain types of locations (e.g. at home, in retail stores). Write code to investigate the crime patterns associated with types of crime locations. Based on the results, which types of locations are more likely to have crime?
Answer: Since Location Description is a discrete variable, we can use the same code as when analyzing Primary Type. Based on the results, we can find that street, residence, apartments and sidewalk account for around 50% of all incidents.
df["Location Description"].value_counts()
So far, we have seen crime patterns linked with Primary Type and Location Description separately. It makes sense to see whether a certain combination of crime type and location type is prevalent or not. We know that both Primary Type and Location Description are discrete variables. We can therefore use a contingency table (cross table) to summarize the total number of incidents that belong to a specific combination of values of Primary Type and Location Description.
We cannot use the previous code where we analyzed Primary Type and Description together since unlike those two variables, Location Description and Primary Type are NOT nested variables. We can use the function crosstab in pandas to generate the contingency table of two variables.
df_1 = df[df["Location Description"].isin(df["Location Description"].value_counts().index[:10]) & df["Primary Type"].isin(df["Primary Type"].value_counts().index[:10])]
pd.crosstab(df_1["Primary Type"],df_1["Location Description"])
Based on the contingency table above, what are the hot spots for the top 10 most prevalent types of crime? Are they the same or not?
Answer. No, theft is markedly different from the others. Theft is spread widely across many locations, whereas the others are concentrated in a few spots.
How can the above table help you deploy your workforce efficiently?
Answer. This analysis is most useful when we want to identify hot spots for a specific type of crime. For instance, if the police department considers eliminating theft as a priority, then the above table can tell us where we should deploy more police forces to combat theft. The table gives us the following insights:
We now move on to investigate the relationship between crime incidents and time; i.e. the Date variable we pointed out early on. Time is one of the most important dimensions for constructing an effective deployment plan. Since we cannot patrol every location 24/7, we must target periods of time with high crime rates. Date gives us a timestamp for each incident, which allows us to count how many incidents happened within a given period of time. Since we have one year's worth of data, we can start with monthly total incidents to see if certain months are crime-prone.
We have covered a few cases dealing with temporal data and we generally group by different units of time (days, weeks, months) to discover different insights from the data. As we investigate from a temporal perspective, it is important to keep the concept of confounding variables we introduced in Case 4.3 in our minds. An example would be discovering a pattern regarding June and July having more crime but the underlying factor is actually temperature. If Chicago had a particularly hot September in the future, a careful data scientist would expect more crimes rather than simply concluding that September always has fewer crimes than June/July.
# convert string to datatime type
df["date_py"] = pd.to_datetime(df.Date)
def plot_time( df, time_var, title, rot = 0):
res = df.groupby([time_var])['ID'].count().reset_index(name="count")
p = res.plot(x = time_var, y = "count", title = title, rot = rot)
return p
df["month"] = df.date_py.dt.month
p_monthly = plot_time( df, "month", "2017-monthly crime patterns")
_ = plt.ylabel("Monthly total incidents")
Which months have relatively higher crime rates? Why?
Answer. Feburary has the lowest number of total incidents and crime incidents peak in July. Overall, more incidents occurred during the summer. This makes sense because Chicago is cold and windy in the winter, and neither perpetrators nor victims like to be out and about much then!
Modify your code for monthly total incidents and instead plot the time series of daily total incidents throughout 2017. Do you still believe that February is the time when crime is least concerning?
Answer. One possible solution is given below:
df["date_new"] = df.date_py.dt.date
# rotate the x-axis tick labels for better visualization
p_daily = plot_time( df, "date_new", "2017 daily crime patterns", rot = 45)
_ = plt.ylabel("Daily total incidents")
We can see that most of the days in March had fewer total incidents than most of the days in February. So why do we observe a discrepancy between the monthly chart and daily chart? The reason is simple: February has 28 days whereas March has 31 days, so the monthly total for February is likely to be the smallest.
Therefore, if we want to compare level of crime across different months, monthly total is probably not a good metric since different months have different numbers of days. To resolve this issue, we will normalize the monthly total into some metric that does not depend on the number of days in a month. Normalization of our data is very common and the appropriate method of normalization is generally determined by domain knowledge and the hypothesis generation process which you recently learned in Case 4.3. You want to consistently see how your hypotheses play out in your data by slicing the data in various ways (one being normalizing certain parameters for better comparison).
A natural choice is to divide monthly total by the number of days in a month. The normalized value is indeed the average daily incidents in a month, which can be compared across different months. Let's take a look at the results if we use this normalization. From this, it is clear that March is in fact the least concerning month:
res = df.groupby(["month"])["ID"].count().reset_index(name="count")
res["count"] = res["count"]/[31,28,31,30,31,30,31,31,30,31,30,31]
_ = res.plot(x = "month", y = "count", title = "2017 monthly crime patterns (normalized)")
_ = plt.ylabel("Average daily total incidents")
Choose the correct normalization approach and modify the code above to visualize the crime patterns in every day of a week. Which day in a week has the largest amount of cases?
Answer. We can visualize the total number of incidents that happened in a specific day of the week and plot these counts across all days of the week to examine the patterns we are looking for. But as in the case of monthly pattern, we need to normalize these raw counts since for example, there is a different number of Mondays and Sundays in 2017. For a given day of the week, we can divide the total number of incidents that happened on that day of the week by the total number of that day of the week in 2017. The result is interpreted as the average daily total incidents. Note this average is across the whole year where the average we considered in the monthly trend case is only over a given month.
df["dayofweek"] = df.date_py.dt.dayofweek.astype("category")
df.dayofweek = df.dayofweek.cat.rename_categories(["Mon","Tue","Wed","Thu","Fri","Sat","Sun"])
res = df.groupby(["dayofweek"])['ID'].count().reset_index(name="count")
res['count'] = res['count']/pd.date_range("2017-1-1","2017-12-31",freq="D").dayofweek.value_counts()[::-1]
_ = res.plot(x="dayofweek", y = "count", title = "2017 crime patterns in a week")
_ = plt.ylabel("Average daily total incidents")
We find that Friday has significantly higher total incidents compared to all other days of week. The lowest total incidents is reached on Wednesday.
Another important dimension we need to consider is the relationship between crime incidents and geographic location. The technical aspect of graphing below can be seen as an extension on what we have previously learned about data visualizations. We recommend you slowly review the code in your spare time until you are able to replicate it on a new problem. How we choose to set up these graphs should always be logical and in this case, we are categorizing the data by police beats.
We have the rough geographic coordinate of each incident and based on these, we can explore the geographic patterns of crime in Chicago. To identify geographic hot spots of crimes, we can partition the City of Chicago into non-overlapping regions and count the total number of cases in 2017 in each region. In this case, we divide Chicago by police beats. We then visualize the results on the map:
# format the beat variable to have leading zeros, count by beat
df["beat_num"] = df["beat_num"].str.zfill(4)
beat_cn = df.groupby("beat_num")["ID"].count().reset_index(name="crime_count")
# color scheme
min_cn, max_cn = beat_cn['crime_count'].quantile([0.01,0.99]).apply(round, 2)
colormap = branca.colormap.LinearColormap(
colors=['white','yellow','orange','red','darkred'],
#index=beat_cn['count'].quantile([0.2,0.4,0.6,0.8]),b
vmin=min_cn,
vmax=max_cn
)
colormap.caption="Total crimes in Chicago by police beats"
# load the shape file for Chicago police beats
beat_orig = geopandas.read_file("Boundaries_beat.geojson", driver = "GeoJSON")
beat_data = beat_orig.join(beat_cn.set_index("beat_num"), how = "left", on = "beat_num")
beat_data.fillna(0, inplace = True)
# interactive visualization for beat-specific crime rate in 2017
m_crime = folium.Map(location=[41.88, -87.63],
zoom_start=12,
tiles="OpenStreetMap")
style_function = lambda x: {
'fillColor': colormap(x['properties']['crime_count']),
'color': 'black',
'weight':2,
'fillOpacity':0.5
}
stategeo = folium.GeoJson(
beat_data.to_json(),
name='Chicago beats',
style_function=style_function,
tooltip=folium.GeoJsonTooltip(
fields=['beat_num', 'crime_count'],
aliases=['Beat', 'Total crime'],
localize=True
)
).add_to(m_crime)
colormap.add_to(m_crime)
m_crime
Overall, we find there are three hot spots in Chicago: Downtown Chicago, West Chicago, and South Chicago. You can hover over each region to see the beat number and the total number of crimes in the beat.
Based on the analysis so far, what are your preliminary strategies for police deployment based on times, locations and types of crimes? What is the potential business problem you are solving here?
Answer: Based on the results, we have several suggestions: put more police force to work on Fridays, between May and September, and place them in Downtown Chicago as well as the West and South Sides of Chicago. They probably need to focus on streets and residence areas and should pay attention to theft, battery, criminal damage, and assault.
What is the main shortcoming of our analysis and recommendations in 4.1?
Answer: We see two main pitfalls:
The above investigations are for individual patterns associated with each individual variable, and the patterns associated with all these independent variables combined might not be independent. In other words, similar to the previous EDA case, there may exist interaction effects among these variables.
We noticed above that most crimes tend to be theft-related. However, most theft offenses are petty, which might not be the type of crime we should focus our limited police force on. More severe crimes, such as homicide cases, are overlooked in the above analyses since they are very rare. But perhaps these rare and severe crimes are the most important offenses to prevent.
The rest of this case study aims to tackle these two shortcomings.
In Exercise 7.2, we cited two potential problems with naively tacking together the patterns we noticed for each individual variable into a recommendation. We tackle the first issue to start: there may exist interaction effects among the variables of interest. Similar to the previous EDA case, we now investigate each potential interaction effect in more detail and challenge our hypotheses.
Again, we can use a contingency table to answer this question just like we did for Primary Type and Location Description. One catch here is that you need to normalize the data so that the comparisons are fair across different days of the week:
res_raw = pd.crosstab(df["Primary Type"], df.dayofweek)
res_raw/pd.date_range("2017-1-1","2017-12-31",freq="D").dayofweek.value_counts().tolist()[::-1]
Your colleague claims that most thefts happen on Mondays, Tuesdays, or Wednesdays. How do you validate or disprove their claim based on the data and the table above? What can you say about the days which have the most battery and assault incidents?
Answer. From the results we see that Friday has the most theft cases (~10% more than other days) which disproves the first claim. Battery and assault have different patterns. Assault cases are more prevalent on weekdays than weekends (~10% more) while battery cases are more prevalent on weekends (~15% more). So deploying more police forces on Friday might not work for eliminating battery offenses.
The next potential interaction is between crime time and crime location. The geographic hot spots might shift from time to time and targeting different regions at different times is a natural strategy to increase efficiency. The following map shows how the crime rate varies geographically over time. Here, the outcome of interest is average daily total incidents:
def folium_slider( beat_cn, beat_orig, tmp_drange, index_var, index_lab,
value_var = "crime_count", caption = "Crimes in Chicago" ):
# get colorbar
min_cn, max_cn = beat_cn[value_var].quantile([0.01,0.99]).apply(round, 2)
colormap = branca.colormap.LinearColormap(
colors=['white','yellow','orange','red','darkred'],
#index=beat_cn['count'].quantile([0.2,0.4,0.6,0.8]),
vmin=min_cn,
vmax=max_cn
)
colormap.caption=caption
# get styledata for folium
styledata = {}
for beat in range(beat_orig.shape[0]):
res_beat = beat_cn[beat_cn.beat_num==beat_orig.iloc[beat,:].beat_num]
#fill missing value by zero: no recorded crime that month
c_count = res_beat.set_index(index_var)[value_var].reindex(tmp_drange).fillna(0)
df_tmp = pd.DataFrame(
{'color': [colormap(count) for count in c_count], 'opacity':0.5},
index = index_lab
)
styledata[str(beat)] = df_tmp
styledict = {
str(beat): data.to_dict(orient='index') for
beat, data in styledata.items()
}
# plot map and time slider
m = folium.Map(location=[41.88, -87.63],
zoom_start=12,
tiles="OpenStreetMap")
g = TimeSliderChoropleth(
beat_orig.to_json(),
styledict=styledict
).add_to(m)
folium.GeoJson(beat_orig.to_json(), style_function = lambda x: {
'color': 'black',
'weight':2,
'fillOpacity':0
}, tooltip=folium.GeoJsonTooltip(
fields=['beat_num'],
aliases=['Beat'],
localize=True
)).add_to(m)
colormap.add_to(m)
return m
# cycle in a year
beat_cn_month = df.groupby(["beat_num","month"])["ID"].count().reset_index(name = "crime_count")
nd = pd.DataFrame({"month":range(1,13), "days":[31,28,31,30,31,30,31,31,30,31,30,31]})
beat_cn_month = beat_cn_month.merge(nd, how = "left", on = "month")
beat_cn_month["crime_count"] = beat_cn_month["crime_count"]/beat_cn_month["days"]
folium_slider( beat_cn_month, beat_orig, list(range(1,13)), "month",
list(pd.date_range( "2017-1", "2017-12", freq = "MS").strftime("%Y-%m")),
caption = "Average daily total incidents in a month")
From the plot above, what patterns do you observe over time in Downtown Chicago, as well as the West and South Sides of Chicago? Based on this, do we need to refine the strategies we outlined in Exercise 7.1?
Answer. It is clear that Downtown Chicago remains a hot spot no matter which month we are looking at. Beat 0114 is a special region. It has an elevated crime rate only in July and August, probably due to tourists.
West Side Chicago requires particular attention from April to October. Beat 1011 in this area is crime-prone year-round.
South Side Chicago has a hot period from April to August. Beat 0511 in this area has a constantly high crime rate year-round while Beats 0833 and 0834 only have high crime rates in January.
As we can see, the strategies we developed in Exercise 7.1 need refinement. The above observations tell us that we need to deploy in different areas at different times of the year. May to August is a hot period for most of the regions we are concerned about but a few beats are swamped with criminal activity year-round.
Our strategies developed in the previous section suggest that we should pay attention to Fridays as well as the months from May to August. However, is Friday always the most crime-prevalent day of the week regardless of the month we are looking at? Using the code we have developed above, we plot the crime patterns in a week for every month in 2017. Note that normalization is still required here. Based on the results, is our strategy in Exercise 7.1 to focus on Friday valid?
res_md = df.groupby(['dayofweek','month'])['ID'].count().reset_index(name="count")
# normalization
date_2017 = pd.DataFrame(
{"dayofweek": pd.date_range("2017-1-1","2017-12-31",freq="D").dayofweek.astype("category"),
"month": pd.date_range("2017-1-1","2017-12-31",freq="D").month } )
date_2017["dayofweek"] = date_2017["dayofweek"].cat.rename_categories(["Mon","Tue","Wed","Thu","Fri","Sat","Sun"])
nd_2017 = date_2017.groupby(['month'])['dayofweek'].value_counts().sort_index().reset_index(name="day_count")
res_md_norm = nd_2017.merge( res_md, how = "left", on = ["month","dayofweek"]).fillna(0)
res_md_norm['count_norm'] = res_md_norm['count']/res_md_norm['day_count']
res_md_norm['dayofweek'] = res_md_norm['dayofweek'].astype("category").cat.reorder_categories(["Mon","Tue","Wed","Thu","Fri","Sat","Sun"])
res_md_norm['month'] = res_md_norm['month'].astype('category').cat.rename_categories(["Jan","Feb","Mar","Apr","May","Jun","July","Aug","Sep","Oct","Nov","Dec"])
mp = sns.lineplot(data=res_md_norm, x='dayofweek', hue = 'month', y='count_norm',
palette = sns.color_palette("hls",12))
mp = mp.legend(loc='center left', bbox_to_anchor=(1.01, 0.5), ncol=1)
_ = plt.ylabel("Average daily total crimes in a month")
_ = plt.title("Crime patterns across all days in a week in different months")
for i in range(12):
tmp = res_md_norm[res_md_norm.dayofweek=="Sun"]
_ = plt.text( 6, tmp['count_norm'].iloc[i], tmp['month'].iloc[i])
Answer. We can immediately spot that not all months have the same day-of-the-week patterns. Friday stands out as the crime rate peak in a week only in the non-summer months. In August, the peak shifts to Saturday. In June, there seems to be no apparent peak across a week. The results here tell us that our "Friday strategy" is probably not going to work during the summer, which unfortunately also happens to be the time of year with the highest crime rate overall.
This section has only explored three pairs of potential interactions across all pairs of variables we have. There are of course many other combinations that are important and will lead to further refinement of our deployment plan. We should be able to identify patterns associated with any pair of variables using the tools we developed here. After taking any interactions we find into account, we can then propose an actionable plan.
So far, we have based our analysis on just the total crime rate. But not all crimes are equally harmful. A homicide case would severely affect a neighborhood even after several years and hinder business development in the area. On the other hand, a case of petty theft is usually not as destructive and would be dismissed after several weeks.
We can define a different type of outcome which emphasizes crime types that need to be controlled to the minimal level. These crimes are generally determined by municipal development plans of Chicago. For example, if the government aims to promote tourism, crimes targeted at tourists, such as theft and deceptive activities should be the main focus of the police department.
In our dataset, the column IUCR is a reporting code that partially measures the damage of an incident to the general well-being of the public. Let's use this code to define a new type of outcome which roughly measures the accumulated damage of all crimes in an area over a period of time. The key idea is that we should focus on places and times that are harmed by crimes the most, not necessarily the ones with the highest total crime incidents.
df['IUCR'].head()
Some of the IUCR codes have a letter after them (A, P, B, R, C, T, N). An examination of the above link shows that these letters are about convictions, which isn't too relevant to us. Write code to remove the letters and convert this column to a numeric type. Then use function hist to visualize the distribution of IUCR. Also use function describe to summarize IUCR (mean, variance, etc). Based on the histogram, do most cases have large IUCR (not severe)?
# This code processes the IUCR column to be truly numeric
iucr = (df['IUCR'].str.replace("A", "")
.str.replace("P","")
.str.replace("B","")
.str.replace("R","")
.str.replace("C","")
.str.replace("T","")
.str.replace("N","")
.str.replace("E","")
.str.replace("H","")
.astype('int'))
_ = iucr.hist()
iucr.describe()
Answer. We can see that most cases have a raw IUCR score smaller than 2000 and around 30% of cases have a raw IUCR score smaller than 500. This means that most cases in our dataset are considered to be moderately severe.
Given these IUCR scores, how would you integrate them into our crime incidence visualizations so that the final results also represent the severity of crimes in a region?
Answer. In all our analysis above, each case is counted as one towards the total number of incidents. We can amplify the effect of certain types of crimes (e.g. homicide) with a predefined scaling factor, which represents that a single homicide case should be considered just as severe as many misdemeanors such as petty theft. By incorporating these scaling factors, regions with fewer but more severe cases would be identified as more dangerous compared to the results from the unweighted visualization. A rule of thumb is to choose the scaling factors to be positive and higher for more severe crimes.
Using the scoring scheme we discussed above, let's consider the monthly crime severity across police beats:
# weighted incidence rate
iucr = iucr.fillna(0)
df["IUCR_num"] = iucr.max() - iucr
beat_cs_month = df.groupby(["beat_num","month"]).aggregate({"IUCR_num":lambda x: sum(x)}).reset_index()
nd = pd.DataFrame({"month":range(1,13), "days":[31,28,31,30,31,30,31,31,30,31,30,31]})
beat_cs_month = beat_cs_month.merge(nd, how = "left", on = "month")
beat_cs_month["severity_tot"] = beat_cs_month["IUCR_num"]/beat_cs_month["days"]
folium_slider( beat_cs_month, beat_orig, list(range(1,13)), "month",
list(pd.date_range( "2017-1", "2017-12", freq = "MS").strftime("%Y-%m")),
value_var = "severity_tot", caption = "Average daily crime severity in Chicago")
It seems that the results here are not significantly different from the results we obtained when we did not incorporate IUCR. This might be because most cases have a relatively small raw IUCR and therefore tend to be relatively equally weighted even when we do incorporate IUCR.
Since IUCR was not particularly useful, let's define a different severity metric. Generally, it is good practice to define a metric that aligns with the bigger-picture goal that the city is trying to achieve. For example, if we want to attract more large companies to open up branches in Chicago, we may care a lot about homicide, sexual assault, and arson, but less about gambling, obscenity, and theft. This type of analysis is rather subjective, but with experience in the field we should gain good intuition about how every type of crime should be bucketed. If we want to be more objective, we could include another dataset of crimes which have been classified by the dollar amount in losses caused by those crimes. We start with a subjective scoring system as an example:
# The zip function creates a list of tuples out of two lists
# The first element of each tuple is the crime type from the first list, and the second element is the severity number
severity_10 = zip(['CRIM_SEXUAL_ASSAULT', 'ARSON', 'HOMICIDE'], [10] * 4)
severity_9 = zip(['BATTERY', 'ASSAULT','ROBBERY', 'BURGLARY'], [9] * 4)
severity_8 = zip(['MOTOR VEHICLE THEFT', 'PUBLIC PEACE VIOLATION', 'CRIMINAL DAMAGE'], [8] * 3)
severity_7 = zip(['CRIMINAL TRESPASS', 'OFFENSE INVOLVING CHILDREN', 'KIDNAPPING'], [7] * 3)
severity_6 = zip(['STALKING', 'PUBLIC INDECENCY'], [6] * 2)
severity_5 = zip(['OTHER OFFENSE', 'HUMAN TRAFFICKING'], [5] * 2)
severity_4 = zip(['DECEPTIVE PRACTICE', 'INTIMIDATION'], [4] * 2)
severity_3 = zip(['INTERFERENCE WITH PUBLIC OFFICER', 'SEX OFFENSE'], [3] * 2)
severity_2 = zip(['NARCOTICS', 'WEAPONS VIOLATION', 'CONCEALED CARRY LICENSE VIOLATION','OBSCENITY'], [2] * 4)
severity_1 = zip(['THEFT','GAMBLING', 'PROSTITUTION', 'LIQUOR LAW VIOLATION', 'NON-CRIMINAL (SUBJECT SPECIFIED)', 'NON-CRIMINAL'], [1] * 5)
# By turning these zipped tuples into a dictionary, we can map each row of the dataset to its severity label
severities = dict()
for s in [severity_1, severity_2, severity_3, severity_4, severity_5, severity_6, severity_7, severity_8, severity_9, severity_10]:
severities.update(dict(s))
# Our last step is mapping primary_type using this dictionary
df['severity_bus'] = df['Primary Type'].apply(severities.get)
This weighting scheme is suitable for visualizing average per case severity but not for the total severity. The reason is that the range of the weights is rather small (1-10) and crimes with high weights are rare. As a result, the weighted sum given by this weighting scheme is very similar to the unweighted sum and visualizing the weighted sum does not provide additional information. The average per-case severity on the other hand is bounded between 1 and 10. So a small change (e.g. 0.1) in the average can still lead to apparent change in the filling color. Let's take a look at the updated average per case severity score:
# note that we still need to do normalization by days
beat_cs_month = df.groupby(["beat_num","month"]).aggregate({"severity_bus":lambda x: sum(x)}).reset_index()
nd = pd.DataFrame({"month":range(1,13), "days":[31,28,31,30,31,30,31,31,30,31,30,31]})
beat_cs_month = beat_cs_month.merge(nd, how = "left", on = "month")
beat_cs_month["severity_bus_tot"] = beat_cs_month["severity_bus"]/beat_cs_month["days"]
beat_cs_month["severity_bus_avg"] = beat_cs_month["severity_bus_tot"]/beat_cn_month["crime_count"]
folium_slider( beat_cs_month, beat_orig, list(range(1,13)), "month",
list(pd.date_range( "2017-1", "2017-12", freq = "MS").strftime("%Y-%m")),
value_var = "severity_bus_avg", caption = "Average crime severity in Chicago")
This map essentially shows the violent parts of Chicago, ignoring theft and putting emphasis on the most violent crimes. Under this prioritization scheme, we see that Downtown Chicago is no longer the worst region; rather, the worst region has now become South Side Chicago. The "safe" North Side is now dotted with some rather violent beats.
We explored Chicago crime records in 2017 to understand crime patterns and proposed preliminary policy deployment strategies based on these patterns. We initially examined the patterns associated with each variable of interest independently. Based on the single variable analyses, we proposed that the police department should put extra forces to work on Fridays between May and August, pay more attention to theft, battery, assault and criminal damage, and deploy more forces in Downtown Chicago and the West and South Sides of Chicago.
We then conducted analyses for three pairs of variables and found that the above strategies are too rigid and don't take into account interaction effects. We found that Friday is a weekly hot spot outside of summertime, but during summertime, either Saturday becomes the hot spot or no hot spot was present at all. We also found that the time windows of high criminal activities for the three geographic hot spots are not the same: Downtown has a high rate throughout the year, whereas the crimes in the other two regions mostly accumulated between April and August. We found that theft is common in all kinds of locations whereas other types of crimes, such as battery and assault, tend to cluster in a very few number of location types.
In the last part of the analysis, we looked at how to customize weights of different crimes based on the business outcome we were optimizing for. We considered a custom severity score, which highlights the extremely violent cases, and found out that many regions in Chicago have low overall crime incident counts but high violent crime rates. Downtown Chicago, on the other hand, did not harbor violent crimes despite its high overall crime rate. This indicates that if eliminating highly violent crimes is the priority, the previously devised strategies should be changed and there should be more emphasis placed on locations like North Side Chicago.
Moving forward, there are many things we can do. Using this dataset, we can explore other pairwise interaction effects. We can also examine if the strategies we proposed here have already been implemented and whether the deployment plan in use now is successful or not. We can also consider more advanced statistical modeling for our dataset so that all variables can be included, and not just the ones we examined. For those that are interested in this topic, a good starting point is here.
In this case, we learned how to perform exploratory data analysis for records of crime. This type of data does not have as clear of an outcome of interest, so we started by assuming that the outcome of interest is the total number of crime incidents. From there, we followed a similar process as in the last EDA case, except:
Finally, we questioned our original assumption that the number of crime incidents was the most important outcome to optimize for. We also looked at how we ought to weight different crimes differently in our analysis based on the particular business problem at hand.
Case 4.3 was focused on EDA for numerical parameters while Case 4.4 was focused on categorical variables. We simply cannot overemphasize the importance of EDA for any data science problem so we recommend you review both cases rigorously. If you are comfortable with the concepts and technical aspects of both cases, you can find a dataset with a mix of categorical and numerical parameters on a platform like Kaggle for further practice. There is no such thing as too much EDA practice.